You are not logged in Log in Join
You are here: Home » Members » Tres Seaver's Zope.org Site » Various Projects » HighlyAvailableZope » Zope hanging (poss. threads-related)

Log in
Name

Password

 
 

Zope hanging (poss. threads-related)

Summarizing a valuable thread from the zope list, week of 10-15 April 2000:

Marcus Collins [1]
Since increasing the NUMBER_OF_THREADS that Zope uses from the default (4) to 20, Zope has been hanging randomly. A request will be sent off to Zope (whether through apache with mod_pcgi, or directly through ZServer?), but the request is never served up, and Z2.log shows no record of it. No additional CPU is taken by the python process.
Ross Lazarus [2]
A number of users have reported serious stability problems - me included. Apache/FCGI or direct to medusa seems to make no difference. One way of stopping the problem is to run with -t 1 but frankly, it's not a viable option for many situations - single threading zope is painful under any kind of load, but at least it doesn't appear to core dump regularly.
Marcus [3]
I've installed Zope 2.1.6 from source on the same machine, with as close to an identical setup as is possible on a live server. Even after really hammering 2.1.6, though, I've been unable to produce the same hanging behaviour observed in 2.1.3. The only differences, apart from the apache rewrite rules, are that the 2.1.6 install is using GenericUserFolder? 1.2.2 (vs. 1.1.0) and SiteAccess? 1.0.1 (vs. 1.0.0). I doubt that these products are the cause of the problem.
Pavlos Christoforou [4]
Although I have stopped chasing this bug, my guess too, is that the problem is with ZServer? somewhere. For one, placing a print statement inside the async loop solved one of the problems I had, but I never figured out what was wrong. In the end I gave up and wrote my own python supervisor module (based on Dan Bernstein's author of qmail) which basically restarts Zope when it dies, plus a script to check if it has hanged in which case it restarts it again.
Tony Rossignol [5]
I have been observing Zope restarts under moderate site load w/ management activity occurring. I have three Linux servers running zope; two w/ RH6.2 and one w/ RH5.2(?). The one on the older version of the OS does not spontaneously restart, both other servers will restart w/ the one designated to receive the /manage activity restarting at least once an hour during moderate use.
Monty Taylor [6]
Well, I can't say for every one else, but my Solaris installations are surprisingly no problem. I say surprising because the installation on Solaris is a hell of a lot more bubble-gum and string to put it where we want it. (I don't care so much on Linux.)... Also, I was experiencing the problem quite a bit on 2.1.2, then upped to 2.1.4 and it quit. I tried and upgrade to 2.1.6, had problems and went back to 2.1.4 and then it started up again. I know from a previous life as a a support person that that is almost useless information, being unreproducable and all, but I'm wondering if the sporadic nature of the problem could be linked to the spoogey install?
Michel Pelletier [7]
I suspect this problem might* be unrelated to the threadlock discussed so far, in the case of the reported lock, 2 or more threads cause instability. In your case, you report 4 is stable...I think you are running up against the hardwirded database connection limit (7) in ZODB. You have more threads then there are connections.*
Chris McDonough? [8]
*I think it's the pool_size parameter in lib/python/ZODB/DB.py in the DB class' __init__ method*
Marcus [9]
Lately, I've seen reports of Zope restarting when manage_main is used; just to clarify, in my case Zope is not restarting: a single thread is hanging (deadlocked for some reason, perhaps?). This occurs in manage_ as well as on our live site, but only a single thread is affected at a time (other users could still log in while this thread was in its quiescent state.
Pavlos Christoforou [10]
I don't think the threading problem is in the asyncore loop but it putting the print statement in there solved another very weird problem I had. I suppose any delay at that point would have solved the dying parent problem but I am not usre why. Still my guess is that the rpoblem is in ZServer? somewhere ....
Marcus Collins [11]
*Our live server, running Zope 2.1.3 with 4 threads, has just had a thread hang. There has been no management activity on this server since last it was restarted about four hours ago. At the time of the hanging, a frameset was being requested (cf. my original post)....A moment ago, I thought this might have something to do with the way ZServer? handles pipelined requests, but the problem exists with PCGI as well as HTTP....As mentioned, I've been testing Zope 2.1.6 under similar conditions, but with 20 threads. So far, I haven't been able to cause it to hang on the frameset pages. However, I have caused a single thread to hang by repeatedly refreshing a page (the method concerned happens to call an external method, if that adds relevant info.)*
Tony Rossignol [12]
The /manage screens also have the potential for the client to send multiple request....And the problem also exists with FastCGI....I also hacked z2.py as well as FCGIServer?.py to get Amos's analyser script to work. I found no specfic method, request or mix of requests consistently incomplete. It almost appeared the larger the resulting page the greater the likelyhood of it being incomplete, which would make sense from a purly statistical perspective.
Marcus Collins [13]
As a twist to this story, it would be interesting to know if those folk who experienced frequent Zope restarts when accessing management screens experienced the same problems using just manage_workspace or ftp (thus bypassing multiple simultaneous requests from the client). I guess what I'm driving at is that this is some kind of timing issue.
Jerome Alet [14]
I've noticed something here, but it doesn't seem thread related since all zope hangs...from time to time (seems to be random), when using the management interface, zope completely hangs when using IE5 under NT4SP5. With IE4 on the same machine (before the update) or Netscape 4.72 under Linux (same machine in dual boot) there's no problem at all. It seems (not sure) the hang comes when an error occurs, and error often occur when you begin with zope...
Marcus [15]
replied with a list of useful questions for DiagnosingHangProblems.
Jerome [16]
reaffirmed the correlation with InternetExplorer5?.
Marcus [17]
*I'm convinced there's some deep, dark, timing problem, and that it's thread-related...This morning, I had a thread hang, again on Zope 2.1.3 running with four threads. I was watching for unfinished requests at the time, and caught this twenty minutes after it occurred....I tried viewing a frameset page on the site. No problem. Then I tried launching /manage, and suddenly all the threads hung. This is the first time all the threads have hung. It's also the first time I've managed to catch an unfinished request and try viewing framesets within half an hour, which leads me to believe that, previously, ZServer? was doing a good job of cleaning up the hung zombie threads (recall the zombie_timeout of 30 minutes).*
Amos Latteier [18]
The ZServer? zombie stuff is to get rid of zombie client connections, not zombie publishing threads. These are quite different beasts...In my experience when a Zope publishing thread hangs its almost always a problem with the published resource. Maybe there's something that puts Zope in a loop that never exits, or maybe there's some DA weirdness that hangs the thead....My advice is to try and identify which requests hang using debug logging and examine the resources that those requests use.
Tony [19]
We have noticed once restarts start they get worse when under a load. I've been suspecting that the longer pages take to load the more people are just stopping the page load, and this would/could create a zombie. The problem is I don't know how to identify or research what is going on under the hood. Could when someone terminates a connection be an issue here? I mean; if the request in queued waiting for a thread might act differently than a request that is being processed by Zope and waiting on a DB query, or even termination once zope has already started passing results back through the pipe. Our restarts are so hard to tie down I'm guessing it's a very subtle issue or a combination of just the wrong factors.
Marcus [20]
The specific resources on which this hanging has occurred are:
  • /manage_*, but not yet when using manage_main alone (again, perhaps a statistical probability)
  • a frameset page on our site, which uses:
    • ZSQL methods calling a MySQL? database (MySQL? 3.22.27/32; ZMySQLDA? 1.1.3 patched with MySQLdb? 0.1.0; db connection pool_size equal to NUMBER_OF_THREADS)
    • GUF 1.2.2 (using the database)
    • SiteAccess? 1.0.1.

    A single thread at a time hangs. Recently, however, when /manage was accessed shortly after a thread hung, ZServer? became completely unresponsive to http or pcgi (didn't check ftp...).

Marcus [21]
You'll sometimes note on the console or your logs something like the following

Example:

   2000-04-14T14:00:26 ERROR(200) ZServer uncaptured python exception, \
     closing channel <PCGIChannel at 87567b0> \
     socket.error:(32,'Broken pipe')
   [!/usr/local/Zope-2.1.6-src/ZServer/medusa/asynchat.py|initiate_send|211]
   [!/usr/local/Zope-2.1.6-src/ZServer/medusa/asyncore.py|send|237])

*This occurs (I surmise) when the client closes the channel before ZServer? has sent its response. I presume the fact that it closes the channel would mean no zombie, but I'd like to know more about this....We have quite a number of the above errors occurring, and they seem to correlate to people terminating the connection (or browser timeout), from what I've seen internally. I also feel at a loss here, focussing on a very small part of ZServer? and possibly missing the big picture.*

-- *Note that the log message Marcus reports is the same as Collector issue #1156. Tres*

Michel [22]?
*What the Zombie timeout means is that after a publishing thread gets done answering a request, the socket may not go away. This may happen for a number of reasons, e.g., the client hung and is not putting down the phone after the converstation is over (so to speak) or network troubles may prevent the connection from closing properly. This means that there is a zombie connection laying around. This zombie will probably end up going away on its own, but if not, ZServer? will kill it after a period of time....The only reasorce laying around during the life of a Zombie is an tiny little unused open socket, the Mack truck of a Zope thread that served the request for the zombie socket does not hang for that entire period of time, but goes on after it has completed the request to serve other requests....Amos is correct in that these problems are almost always at the Application level, and not at the ZServer? level. The fact that Pavlos can prevent hanging by inserting a print statement in the asyncore loop is suspicious, but we do not have enough information yet to point fingers anywhere.*
Tres [23]
Here are a couple of commonalities we need to look at:
  • InternetExplorer5? seems to be involved.
  • The management interface seems to trigger the problem (no application-specific code involved, just manipulation of "stock" Zope objects).
  • Framesets may be involved.
  • Perhaps premature request cancellation (likely to happen often in the management interface) is involved.

I would like to see a description of the interaction between the ZServer?-async-select() thread and the Zope threads to which it dispatches requests, particularly what happens when a Zope thread tries to write to a connection which has been closed by the client without a "clean" shutdown (e.g., attempting to write to the socket will fail with errno == EPIPE)....I have a couple of testcases (see FramesetHangTests) for this problem. (Un)fortunately, I don't use IE5 (and don't go with girls who do! :), and therefore they all work without a hitch. Anyone care to try it on their setup with IE5 as the client?

References

[1] http://lists.zope.org/pipermail/zope/2000-April/023608.html

[2] http://lists.zope.org/pipermail/zope/2000-April/023639.html

[3] http://lists.zope.org/pipermail/zope/2000-April/023666.html

[4] http://lists.zope.org/pipermail/zope/2000-April/023697.html

[5] http://lists.zope.org/pipermail/zope/2000-April/023705.html

[6] http://lists.zope.org/pipermail/zope/2000-April/023716.html

[7] http://lists.zope.org/pipermail/zope/2000-April/023726.html

[8] http://lists.zope.org/pipermail/zope/2000-April/023732.html

[9] http://lists.zope.org/pipermail/zope/2000-April/023742.html

[10] http://lists.zope.org/pipermail/zope/2000-April/023778.html

[11] http://lists.zope.org/pipermail/zope/2000-April/023779.html

[12] http://lists.zope.org/pipermail/zope/2000-April/023783.html

[13] http://lists.zope.org/pipermail/zope/2000-April/023827.html

[14] http://lists.zope.org/pipermail/zope/2000-April/023841.html

[15] http://lists.zope.org/pipermail/zope/2000-April/023843.html

[16] http://lists.zope.org/pipermail/zope/2000-April/023847.html

[17] http://lists.zope.org/pipermail/zope/2000-April/023862.html

[18] http://lists.zope.org/pipermail/zope-dev/2000-April/004194.html

[19] http://lists.zope.org/pipermail/zope-dev/2000-April/004214.html

[20] http://lists.zope.org/pipermail/zope-dev/2000-April/004215.html

[21] http://lists.zope.org/pipermail/zope-dev/2000-April/004216.html

[23] http://lists.zope.org/pipermail/zope-dev/2000-April/004229.html

[24] http://lists.zope.org/pipermail/zope-dev/2000-April/004235.html